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ABSTRACT 

\\c ilescrihe MARIE- 1 and MARIE-2, infonnation reuieval systems for multimedia 
tiaia. IdiCy exploit captions on ihe data and perforin natural-language processing of litem and 
Eiigiisli retrievid requests. Some content analysis of the data is also performed to obtiiin 
iulditioiial descrijuive inforinauon. Tlte key to getting ihis approach to work is sufhciently- 
lasi [)i()cessing. We achieve this by decomposing the problem into ’’information Idlers" and 
ajtplying a new liieory of optimal infonnation filtering which we have dcvelo[>cd. 

‘ I'liis woik was sponsored l>y DARPA a.s piirt of Ihe 13 Project under AO 8939. and by the U. S. Naval Postgra- 
duate School niuki hinds provided by the Cliief for Naval Operations. Discussions with Amr Zaky improved 



this paj)ci 



1. Introduction 



The MARIE project hi\s been investigating information retrieval of multimedia data using a new idea: putting 
pnnKiry emphasis on caption processing. Even though content analysis methods such as substring searching for 
text media and shape matching for picture media can obviate captions, content analysis usually requires 
unacceptably-large amounts of time at retrieval time. Captions can be though of as cachings of the results of 
content analysis, created either manually by a user describing a multimedia datum, automatically by computer- 
ized content analysis, or some combination of both; but they can also include auxiliary information like the date 
or customer for a photograph. Since captions can be considerably smaller than the media data they describe, 
checking captions before retrieving media data can save time if it can rule out many bad matches quickly. In 
other words, caption information can be passed through fast "information filters" [1] to rule out media data 
irrelevant to a user needs. 

However, caption processing does not necessarily give faster multimedia retrieval. The terms of the caption are 
perhaps synonyms or subterms of those supplied by a user during retrieval, in which case a complete thesaurus 
of synonyms and a complete type hierarchy covering more general and specialized terms should be available for 
use when matching the caption during information retrieval [21]. Furthermore, to obtain high query recall and 
precision, user-supplied captions should be subject to natural-language processing to determine the correct word 
senses and how the words relate, to get beyond the limits of keyword matching on the caption [11]. These addi- 
tional processing needs can make caption processing slow. So the MARIE project is concerned with methods of 
improving efficiency of caption-based approach to information retrieval. This paper reports on three important 
directions that we have explored recently: an efficient statisucal parser for natural language, special content- 
analysis methods, and using sampled parameters to find the optimal execution strategy for retrievals. 

While the MARIE project is intended for multimedia information retrieval in general, we have used as testbed 
the Photo Lab of the Naval Air Warfare Center (NAWC-WD), China Lake California USA. This is a library of 
approximately 100,000 pictures and 37,000 captions for those pictures. The pictures cover all activities of the 
center, including pictures of equipment, tests of equipment, administrative documentation, site visits, and public 
relations. With so many pictures, many of which look virtually identical, captions are indispensable to find what 



a user is looking for. But the existing computerized keyword system for finding pictures from tlieir captions is 
unhelpful, and is mostly ignored by personnel. [17] reports on MARIE-1, a protoi>T)e implementation that we 
developed for them, a system that appears much more in the direction of what users want. Figure 1 shows an 
example retrieval from MARIE- 1, for the query "side view of an F- 18 aircraft flying loaded with missiles": 12 
pictures were found (with fits ranging from 5.0 to 8.0), three of which are displayed in the bottom right with 
their associated registration information, and the top of the upper left box shows the semantic interpretation con- 
structed for the queiy\ MARIE- 1 took a man-year to construct and only handled 220 pictures from the database. 
To handle the full database, efficiency and implementation-difficulty concerns become paramount. MARIE-2, 
currently under development, will address these 



2. Statistical natural-language parsing 

Some natural-language processing beyond keyword matching seems important for visual and audio multimedia 
because relationships between components are more important for them than for most documents. For insumce, 
users should demand that "tank target" should not match just any caption mentioning "tank" and "target", nor 
"steel airplane propeller" match a caption mentioning "steer, "airplane", and "propeller" separately, nor "missile 
on din" match "din on missile". Similarly, users should expect a type and pan-whole hierarchies to be used, so 
"closeup of wing markings" should match "view of wing" To permit such reasonable behavior, we will need to 
do parsing cind some semantic interpretation of each caption and query. 

MARIE- 1 uses tlie standard approach of intelligent natural-language processing for information retneval [9, 13, 
19] of hand-coding of lexical and semantic infonnation for the words in a narrow domain. This approach would 
be laborious and near-unworkable for the 32, (KX) distinct words in the 100,000-caption NAWC database. But a 
new approach to natural-language processing has emerged in the last few years, statistical parsing. It assigns 
probabilities of co-occurrence to sets of words, and uses these probabilities to guess tlie most likely interpreta- 
tion of a sentence. The probabilities can be denved from statistics on a corpus, a representative set of example 
sentences, and they can capture fine semantic distinctions that would otherw'ise require additional lexicon infor- 
mation. Statistical parsing seems an excellent way to implement MARIE-2 since it replaces invocation of many 
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labonously-handcrafted semantic routines with a few simple and fast calculations on statistics automatically 
acquired from a corpus witli many similar sentences 

Statistical parsing is especially well suited for information-retrieval applications because they already have a sta- 
tistical aspect: They find data that is probable, but not guaranteed, to satisfy a user. Also, good infonnation 
retrieval does not require the full natural-language understanding that hand-tailored semantic routines provide: 
Understanding of the words involved in matching is not generally helpful for matching beyond the synonym and 
hierarchical type and part information for those words. For instance, the query "missile mounted on aircnifi" 
should match all three of "Sidewinder on F-18", "Sidewinder attached to wing pylon", and "Pylon mounted 
A1M-9M Sidewinders" since "Sidewinder" and "AIM-9M" are types of missiles, "F-18" as a kind of aircraft, and 
"on" and "attached" mean the same thing as "mounted". In fact, the MARIE- 1 captions were often very' impre- 
cise with verbs, so that detailed semantic analysis of verbs and their cases in captions was unhelpful. Parsing is 
still essential to connect related words in a caption, so to recognize that the three examples above have the same 
deep semantic structure. But for in format ion -retrieval applications, this parser can be simpler than one required 
for full natural-language understanding, with fewer grammatical categories and fewer rules. 

Creating the full synonvm list, t>qxj hierarchy, and part hierarchy for applications of the size of the NAWC-WD 
database (32,000 words) is some work. Fonunately, most of this job for any English application has been 
already accomplished in tlie Wordnet system [12] 1990), a large thesaurus system that includes this infonnation 
plus rough word frequencies and moiphological processing. We converted its information for the NAWC-WD 
words into a Prolog fonnai compatible with the rest of MARIE-2, and used this as our lexicon for parsing and 
interpretation. So the basic meaning assigned to a noun or verb is that it is a subtype of the concept designed 
by its name in the type hierarchy, with additional pieces of meaning added by its relationships (like 
modificauon) to other words in the sentence. Wordnet also includes extensive lists of synonyms; using the 
rough word-frequency information, we designated the most common one of each synonym set as the "standxird 
alias", and store only the type and pan pointers for this word, which considerably shortens the lexicon. 



2.1. Statistical parsing techniques 



This approach can mean fast processing since we just append the type and relationship specifications for all the 
words in a sentence, resolving references using the parse tree, to obtain a "meaning list" or semantic graph, fol- 
lowing the paradigm of [6]. But this can still be slow because we need to find all the reasonable interpretations 
of a sentence in order to rank them, and most sentences have multiple interpretations. To simplify matters, we 
restricted the grammar to binary^ parse rules (context-free grammar rules with only one or tv.o symbols for the 
replacement). Then the likelihood of an interpretation car. be found by assigning probabilities to word senses 
and grammar rules. If we could assume near-independence of the probabilities of each part of the sentence, we 
could multiply them to get the probability of the whole sentence [8]. This is mathematically equivalent to tak- 
ing the sum of the logarithms of the probabiliues, and hence a branch-and-bound search could be done to 
quickly find the N best parses of the a sentence. 

But words of sentences are obviously not often independent or near-independent. Statistical parsing often 
exploits tlie probabiliues of strings of successive words in a sentence [10]. However, with our binary parse 
rules, a simpler and more semantic approach is lo only consider the probability of co-occurrence of llie two sub- 
parses in the binar>' rule. For example, in parsing "F-18 landing" by the rule "NP -> NOUN PARTICIPLE", tlie 
probability assigned to this rule should reflect the likelihood of an F-18 in particular doing a landing in addition 
to the probability of using this rule. The co-occurrence probability for "F-18" and "land" is especi^illy helpful 
because it is unexpectedly large, since there are only a few' things in the world that land. Estimates of co- 
occurrence probabilities can inherit in the type hierarchy [14]. So if we have insufficient statistics in our corpus 
about how' often an F-18 lands, we may have enough on how often an aircraft lands; and assuming that F-18s 
are typical of aircraft in this respect, we can estimate how often F-18s land. The second word can separately be 
generalized too, so we can use statistics on "F-18" and "moving", or both the words can be simultaneously gen- 
eralized, so we can use statistics on "aircraft" and "moving’ . The objective should be to find some statistics that 
can be reliably used to estimate the co-occurrence probability of the words. 

To keep this number of possible co-occurrence probabilities manageable, it is important to restrict them to two- 
probability. When parse rules recognize multiword sequences as grammatical units, those sequences can be 
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reducccJ to "headwords”. For instance, "the big F-18 from China Lake landing at Annitage Field” can be parsed 
by the rule ”NP => NP PARTP” and the same co-occurrence probability used, since "F-18" is the principal noun 
and hence headword of the noun phrase "the big F-18 from China Lake", and "landing” is the participle and 
hence headword of the participial phrase "landing at Armitage Field". 

The statistical database for binary co-occurrence statistics will need careful design because the data will be 
sparse and there will be many small entries. For instance, for the NAWC-WD captions with 32,000 possible 
words and 9,000 superconcepts and aliases of those words, there are 26,000 distinct lexicon entries after 
equivalent aliases are removed and all word senses are included. This means 343 million possible co-occurrence 
p;iirs, but the total of all their counts can only be 605,000, the total number of word instances in all the captions. 
Our database uses four search trees indexed on the first word, the pan of speech -i- word sense of the first word, 
the second word, and the pan of speech + word sense of the second word; it stores the count for that word pair. 
It is imponant to store counts rather than probabilities to save storage and reduce work on update. Various 
compression techniques can funher reduce the size of this database, but one in particular in especially useful, 
elimination of data that can be closely approximated from other counts [14] using sampling theo^>^ For 
instance, if "F-18" occurs 10 times in the corpus, all kinds of aircraft occur 1000 times, and there are 230 
occurrences of aircraft landing, estimate the number of "F-18 landing"s in the corpus as 230* 10/1000=2.3; if the 
actual count is within a standard deviation of the value, do not store it in the database. The standard deviation 
when n is the size of the subpopulation, N is the size of the population, and A the count for the population, is 
V/4 {N-A ){N-n )ftiN"{N-\) [4], Such calculadons require also "unar>^" counts stored with each word or stan- 
dard plirase, but there are far fewer of these. (While unary counts also directly affect the likelihood of a partic- 
ular sentence, that effect can be ignored in judgmg different interpretations of a sentence since it is constant.) 

3. Integrating content analysis 

Another way to obtain descriptive caption information for a multimedia datum is to analyze its content directly, 
as in [2, 5]. For text data this can be parsing and summarization, but for pictures, audio, and video it is more 
complex. Audio can be reduced to a picture by a Fourier transform, and video can be converted into a sequence 
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of still pictures. Thus llie central problem for content analysis of multimedia is one of recognizing and classify- 
ing shapes in a two-dimensional picture. For instance, aircraft in NAWC-WD photographs are usually the only 
objects with four bumps in two symmetrical pairs; even if the caption doesn’t say so, such a shape should be 
considered evidence of an aircraft. We developed some powerful domain-independent picture processing 
methods in [18]: additional domain-dependent knowledge is also needed to classify shapes. Then qualitative 
relationships between the shapes can be determined. The shape and relationship facts can be collected as a 
visual summary of the picture, and this can be merged with explicit textual caption information. 

Content analysis of pictures can be complex because interesting ones (or audio or video) can contain man> 
different shapes and relationships between them. The work may be done when multmiedia data are added to tlie 
databases, and different processors can work on difterent parts of the picture simultaneously to get results faster. 
To avoid creating unwieldy captions, the amount of such information can be limited to that for the highest- 
priority shapes (like aircraft for the NAWC-WD pictures, or the long sounds in [18]). Alternatively, we can 
store only information about regions mentioned in the caption, but this requires we relate the caption graph and 
content-analysis graph. In general, the caption graph, excludmg nondepictable concepts like ”view’\ "test'\ and 
dates, will be a subgraph of tlie content-analysis graph, and a subgraph isomorphism problem must be solved to 
merge the two into a single graph. The s.’J’^graph isomorphism problem is NT-complete in general, although this 
application of it provides are variety of special heuristics to exploit. But the resulting consensus graph will pro- 
vide better picture-description information than either graph alone. 

Just as captions have linguistic foci, pictures that depict have visual foci, something not true of pictures in gen- 
eral. That is, if a picture is to be considered a *’good’‘ depiction of something, and worth storing in a mul- 
timedia library, the object(s) depicted usually can be inferrable from the picture alone. However, photography is 
a less precise enterpnse than entering captions because photographs sometimes must be taken in a hurry, and the 
best angle to the subject or best distance from the photographic subject is not always possible, and it is also 
much harder to ’’edit” the results. So visual focus can only be established by a set of factors that positively 
correlate with it. 



We have identified six major factors that can be applied to the regions identified in a picture to rate how likely a 
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region or set of contiguous regions is to be a visual focus. First, a visual focus tends to be a big region or set of 
regions (with exceptions for photographs illustrating the context of some subject). Second, a visual focus tends 
to be surrounded be a strong edge, or clear discontinuity in brightness, color, or texture. Third, a visual focus 
tends to be either a uniform color or color mix, although its brightness may vary considerably. Fourth, a visual 
focus tends not to touch the boundary of picture, though large objects can touch a little (with one major excep- 
tion: People and some animals are generally considered depicted if their faces are depicted.) Fifth, a visual 
focus has its center of mass close to the center of the photograph. Sixth, there are few other regions or region 
clusters having the same properties as the visual focus (with exceptions for some natural pictures like those of 
flowers in a field). 

So early visual processing should be adjusted, in thresholds and in the techniques used, to find such a region or 
regions, using parameters for textural discrimination between regions if necessary; [18] describes the techniques 
we are exploring for this in one domain. The tendency of these six factors to correlate with visual focus natur- 
ally maps to a neural net with the factors as inputs. The neural net should be trainable, since there are no 
human experts to consult with on the proper weightings of the factors. The weights on the factors also need 
adjusting to the domain and picture type within the domain because they can obviously vary significantly. For 
example, for most NAWC-WD pictures, the fourth and fifth factors are very important, and the first factor is 
quite unimportant because there many occasions when the context in which a small object is embedded is more 
important than the object. But process documentation pictures, type (4) of the last section, are often taken in a 
hurr>' at NAWC-WD, and for them the first, fourth, and fifth factors must all be weighted lightly. 

Another way to handle large captions derived from content analysis is to use supercaptions, captions describing 
common features of sets of captions. Explicit supercaptions occur firequently with the NAWC-WD pictures for 
sets of photographs taken of the same subject in the same picture-taking session. On querying, the supercaption 
can be matched first to the user query, and if it passes, the fuU caption can be matched. Supercaptions can form 
a hierarchy, possibly quite different from the type hierarchy. We have done some simple experiments using 



supercaptions, with positive results. 



4. Finding an efficient execution plan for a query 



One objection raised to natural-language processing for information retrieval [20] is that even if you get can the 
parsing and meaning-list construction to be done quickly, you still have other problems, including a different 
subgraph isomorphism problem, to solve in matching the query graph (or "meaning list") to candidate caption 
graphs. The latter took an average of two seconds per queiy-caption pair on a Sun-4 workstation using a simple 
algorithm in MARIE- 1. Certainly the content-analysis methods of the last section can be slow. Furthennore, 
multimedia data can be large and will be usually slow to retrieve under traditional databitse metiiods. We 
believe that speed problems for multimedia retrieval be significantly minimized by appropriate prior use of 
"information fillers", processes that rule out matches using simple polynomial-time cniena. We will assume 
here tiiat mfonnaiion filters guarantee perfect recall although not necessarily perfect precision, or that they never 
rule out an acceptable data match. Signature matching [7] is the most familiar infonnation filler for multimedia 
retrieval, but it can be done more than once for ari apphcation [3], and filters based on semantic or "intelligent" 
criteria are also useful. 

MARIE- 1 got much power from "coarse-grain" filters that extracted nouns from the quer>' and retrieved indexes 
of captions that mentioned those nouns or their superconcepts (their generalizations in the type hierarchy). In 
subsequent work, [15] reported significant power from a filter that assigns a set of possible categories to each 
picture based on its intended purpose, and matches these to categories inferred for the query. [16] then reported 
experiments with a "registration-data" filler to extract restrictions covered by the bookkeeping infonnation for 
each picture, information that can be stored separately in a relational database; the filter executes SQL queries 
on this database, and rules out pictures based on the results. 

[16] also develops mathematical criteria with proofs for local optimality conditions of execution plans of infor- 
mation fillers. These conditions can be evaluated in polynomial time, and can be the basis of a greedy algo- 
rithm that experimentally demonstrated near-perfect success in finding the globally optimal sequence of a con- 
junction of fifteen or fewer randomly-generated filters. These conditions derive from a decision-theoretic 
proeessing-cost model of the the expected cost of sending a data through a conjunctive sequence of fillers: 



C^(f i) C^(f ifsf 2) + + (f lA/ 2 * ' ' A/*m-l) 
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where /, is the event of passing filter i, pifi) is the probability of passing filter and c, is the cost of passing 
filler i. Tlicn [16] gives a local optimality criterion against interchange of filters i and i-\-\ in the conjunctive 
filter sequence if: 

I/,^2A • • • /,-l)] 2 if Ul\f 1^2^ • • • /.-l)] 

and a lcx;a] optimality criterion against deletion of redundant but fast filter i : 

Ci+Ci^iP(fi l/i • • • /V.-i)+c,+2P(/.A/'<+iI/i • • • /y.-iH • • • +c^p(fi ■ ■ ■ A/'e-il/iA • • • /,-i) 

< c,+i+c,+2/^ (/■,+! l/i A • • • /,_iH • • • +c,p(fi+i ■ ■ ■ /Ve-il/iA • • • /i_i) 

Dual criteria can be proved for disjunctive sequences, on the inverse of the probability involved. 

Further local optimality conditions we prove in [16] are that distributive laws should be used to factor terms 
whenever possible, and that DeMorgan’s Laws should be used to push negations in as far as possible in the 
boolean expression of the sequential filter execution plan. Finally, and most surprisingly, we proved it is never 
locally optimal to have different information filters operating in parallel, no matter how many additional proces- 
sors are available, because the increased throughput does not compensate for the increased workload on each 
filter. This proof makes only broad assumptions: That the cost, per unit number of data items, of n processors 
doing a filter i is for some g where g"(n)<0 and g(0)=0. However, using multiple processors on 

the same filter simultaneously is locally optimal under the same processing model, the approach of [22]. 

The above optimality analysis can be used to find a good consensus execution plan for information filtering for 
an application, using means of costs and probabilities on a representative set of queries and captions, as we did 
in [16]. But it can also be used to improve upon the consensus execution plan for a particular query at runUme. 

If we first apply the consensus execution plan to a small random sample of the input data, we can estimate 

problem -specific values for costs and probabilities, and replan based on those. This is useful when there are hid- 
den correlations (conditional probabilities) between the words of a query. One application is to deciding 
whether to interleave index lookups for the panicular nouns of the query with other more global analysis of the 
query. For instance for the query "AIM-9R on an aircraft”, "aircraft" is very common in the NAWC-WD cap- 
tions, and AIM-9RS are usually shown on aircraft; so the mathematical criteria will say that we ought to first 
find pictures of AIM-9Rs, then do picture-type matching, and then check to see if the remaining candidate cap- 
tions mention an aircraft (and then do subgraph matching to confirm that the AIM-9R is on the aircraft and not 
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beside it). 

We confinned the predictions of our theory of opiimal execution plans in two quite different sets of experiments 
reported in [16]. In one set of experiments we repeatedly generated random filter sets up to size 15, and 
checked whether a greedy algorithm based on the above local optimality criteria could find the globally optimal 
conjunctive sequence of those filters. We verihed the global optimum by exhaustive search through all possible 
filter sequences containing the required filters plus some improper subset of the redundant filters. Fig. 2 shows 
t>pical results that we obtained, in this case foi 13,000 expenmenis in which costs were even distributed on die 
range 0 to 10 and probabilities were evenly distributed on the range 0.01 to l.O. In Fig. 2 in particular, 0.2 was 
the probability that a filter was redundancy-creating and 0.8 was the probability that a filter was redundant with 
respect to some redundancy-creating filter later in a conjunctive sequence, parameters close to those of MARIE 
and the variations on it that w^e have explored. The horizontal axis is the number of filters considered, and tlie 
vertical axis is the mean of the logarithms of the output par^uneter indicated. It can be seen that the number of 
local optima grows significantly more slowly than the size of the search space, the number of sequences con- 
sidered by exhaustive search. The ratio of the cost of the filter sequence found by our polynomial-time greedy 
algorithm to the cost of the filter sequence found by exhaustive consideration of all possible sequences is very- 
close to unity. Thus even if this problem is exponential in time complexity in the worst case, simple 
polynomial-time algorithms usually wwk so well that there is little reason to use anything else with 15 or less 
filters. 

The second set of expenments involved more detailed modeling of MARIE-1, using more detailed par^imeters 
derived from 44 test queries, all but 2 of which were supplied by naive users of the existing NAWC-WD sys- 
tem, We estimated cost and probability parameters by running each filter separately on the database of 217 cap- 
tions used in [17]. We then confirmed tliat the actual performance of our prototype system on the 44 queries 
was very close to that predicted by theory. For instance, comparing cost of filters without the picture-type 
matcher to cost with it, we observed a ratio of 1.18 with a standard deviation of 0.43 versus a predicted ratio of 
1.33; and in comparing cost of fillers without the keyword matcher to cost with it, we observed a ratio of 22.1 
witli a standard deviation of 17.3 versus a theoretical ratio of 29.7. In the first companson, the theoretical 
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optimum was optimal in all but 9 of the 44 cases, and in the second comparison, the theoretical optimum was 
optimal in all 44 cases. These experiments are encouraging. We hope to do further experiments, and explore 
more filters and more complicated filters. 
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